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The sequencing of the human genome has fueled the last decade of work to functionally characterize genome content. An important 
subset of genes encodes membrane proteins, which are the targets of many drugs. They reside in lipid bilayers, restricting their 
endogenous activity to a relatively specialized biochemical environment. Without a reference phenotype, the application of systematic 
screens to profile candidate membrane proteins is not immediately possible. Bioinformatics has begun to show its effectiveness in 
focusing the functional characterization of orphan proteins of a particular functional class, such as channels or receptors. Here we 
discuss integration of experimental and bioinformatics approaches for characterizing the orphan membrane proteome. By analyzing 
the human genome, a landscape reference for the human transmembrane genome is provided. 
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Introduction 

The availability of the human genome sequence' 1, 21 provided 
the blueprint for the diverse elements encoding the proteome. 
The exciting opportunity of comprehensively deciphering the 
function of these sequences remains a challenge. Tradition- 
ally, translating knowledge of a linear nucleic acid (amino 
acid) sequence into mechanistic insights requires a mixture 
of phenotypes obtained through genetic investigation, recon- 
stituted biochemical assays, and structural determination. 
Though for any gene these studies may prove technically 
challenging, they are particularly so for membrane proteins 
at the cell surface or in intracellular organelle bilayers. Mem- 
brane proteins include receptors, ion channels, transporters, 
and enzymes. Constituting a significant fraction (20% -30%) 
of human genes' 31 , membrane proteins represent the targets of 
over half of known drugs' 4, 5 '. As the lipid membrane of the 
cell constitutes only 6%-12% of the cytosolic volume, with the 
plasma membrane representing only 2%-5% of this total' 6 ', the 
biochemical environment necessary for transmembrane pro- 
tein function is highly specialized. Furthermore, the chemical 
compositions of the two sides of the membrane are physiologi- 
cally different; a membrane protein is thus theoretically situ- 
ated in three biochemically distinct environments. In addition 
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to the critical requirement of lipid environment, like soluble 
proteins functional characterization of membrane proteins 
faces other challenges including functional redundancy, mac- 
romolecular organization and dependence on physiological 
conditions' 7-14 '. 

Because of these challenges of characterizing a membrane 
protein, studies to understand the role of novel genes would 
benefit from the ability to narrow the potential number of can- 
didates. In one scenario, molecular determinants are sought 
for a specific physiological process or disease phenotype that 
is hypothesized to involve membrane receptors, such as ion 
flux. Here, 'de-orphanizing' involves finding those genes 
whose presence or function correlates with this phenotype, 
through reverse genetics, transcriptional profiling, and other 
methods' 15 " 17 '. Alternatively, the phenotype of interest may not 
be known beyond a general category such as ion channel, and 
the challenge is to identify a plausible collection of uncharac- 
terized genes that may share general functional similarity with 
known families' 18, 15 '. Consequently, 'de-orphanization' may 
also involve identifying the native ligand for novel receptors, 
ionic substrate for orphan channels or transporters, and physi- 
ological protein-protein interactions. Thus, it aids in the defi- 
nition of their functional phenotype' 20 " 22 '. In all cases, it is help- 
ful to leverage data on the functionally characterized portion 
of the genome to infer the biological roles of the unannotated 
set based on existing information. Traditionally, this idea is 
demonstrated by the use of nucleic acid (amino acid) sequence 
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similarity to infer possible functional homology. A popular 
heuristic algorithm for this problem is the Basic Local Align- 
ment Search Tool (BLAST), which detects statistically signifi- 
cant matches between a query sequence and a database using 
a reference distribution of randomized sequence alignments as 
the 'null' comparison' 231 . More complex approaches have also 
been proposed, such as hidden Markov models (HMM), which 
accommodate variations in insertion/ deletion probability in 
different domains of a protein, instead of the position-agnostic 
gap penalty used by BLAST' 24, 251 . Furthermore, innovations in 
statistical 'machine learning' models allow sequence data to be 
combined with other protein features and annotations to make 
functional predictions. As more information from large-scale 
functional and interaction studies becomes available, this kind 



of data integration will likely play an increasing role in priori- 
tizing candidate lists of functionally uncharacterized genes as 
potential molecular determinants for phenotypes of interest. 

In this perspective we review the roles that bioinformatics 
can play in deorphanizing the uncharacterized membrane pro- 
teins in the human genome. This task is outlined in Figure 1, 
which involves the two strategies outlined above. In the first 
scenario, a phenotype of interest is known, and genome-wide 
screens are used to generate candidate orphan genes which 
may be the molecular determinant for the process of interest. 
Here, bioinformatics approaches such as topology prediction 
are used to filter the results, with overall predictive accuracy of 
less concern than the detection of a single validated determi- 
nant for the phenotype. Alternatively, the objective is to iden- 
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Figure 1. Deorphanization strategies. Left (Blue): In silico analyses of genomic sequences, topological prediction, and functional prediction. Right (Red): 
Phenotype of interest followed by genomic screen, bioinformatics evaluation of candidate list topology, and experimental validation. 
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tify the unknown function of these novel genes, with the only 
reference as their similarity to known proteins. In the first 
step of this class of investigation, genomic databases are used 
as a basis for global prediction of membrane proteins through 
topological models. Secondly, these membrane proteins are 
further clustered into functionally related groups, based on 
sequence homology, conserved motifs, and existing annota- 
tions. As discussed in the following sections, this analysis has 
traditionally consisted of solely in silico approaches, where 
accuracy is judged through retrospective analyses predicting 
the class of previously characterized proteins. However, one 
may speculate that such methods could effectively narrow the 
search space for novel membrane proteins in experimental 
studies, particularly in cases where a phenotype of interest is 
not known or well characterized. After reviewing examples 
of the methodologies and results from both approaches, we 
provide an analysis of the current landscape of character- 
ized and orphan membrane proteins in the human genome 
that might be utilized as a broad guide for de-orphanization 
efforts. Finally, we discuss future challenges, particularly in 
integrating experimental and bioinformatics approaches in 
cases where the phenotype of a novel transmembrane protein 
is not known in advance. 

Genomic prediction and validation of transmembrane 
protein function 

The ability to survey the expression and activity of a large 
number of genes through microarrays, large-scale proteomics, 
and functional genetics screens has greatly aided the ability 
to survey and molecularly characterize diseases and signaling 
pathways' 26-301 . Because these processes may involve cascades 
that begin at the plasma or organelle membrane, transmem- 
brane proteins that are the primary drivers initiating these 
pathways may have a similar readout to downstream compo- 
nents in these assays. Thus, bioinformatics that can identify 
transmembrane proteins helps to narrow the number of candi- 
dates in which to invest follow-up experimental effort. More 
effectively, bioinformatics may even potentially focus the 
number of genes initially screened by identifying candidates 
using existing datasets for novel functions. Both cases are 
illustrated by recent examples summarized in Table 1. 

The discovery of Leucine zipper-EF-hand containing trans- 
membrane protein 1 (Letml) as a mitochondrial Ca 2+ /H + 
antiporter demonstrates the use of bioinformatics to refine a 
list of candidates from genomic functional assays' 151 . To iden- 
tify proteins implicated in calcium transport across the inner 



Table 1. Novel experimentally validated membrane proteins. 



Gene 


Methods 


References 


Letml 


Genomewide RNAi, TM prediction 


[15] 


dBestl 


Genomewide RNAi, TM prediction 


[31] 


Piezol, 2 


Enhanced expression in N2A cells 


[33] 


TMEM16a 


IL-4 induced gene expression in microarray 


[16] 


MCU 


Phylogeny, microarray, mass spectrometry 


[17] 



mitochondrial membrane, the authors used a genome-wide 
RNA interference (RNAi) screen in Drosophila cells using 
fluorescent calcium and membrane potential-sensitive dyes to 
identify genes whose loss affected the ion homeostasis of the 
mitochondrial compartment. Having identified a list of can- 
didates, they further filtered the results to include only those 
with predicted transmembrane segments, as soluble proteins 
might be members of signaling pathways that indirectly mod- 
ulate but are not themselves directly implicated in ion trans- 
port. A subsequent homology search for related mammalian 
sequences yielded Letml as a Ca 2+ /H + antiporter. 

Similarly, the chloride-conductive 'swell' Drosophila Bestro- 
phin 1 (dBestl) channel was identified using a fluorescence 
anion-sensitive dye in a flux assay combined with RNAi 
knockdown' 311 . As with the Letml study, bioinformatics was 
used to eliminate candidate genes regulating cell volume 
and chloride conductance lacking predicted transmembrane 
spanning segments. A challenging aspect of this study is that 
chloride channels, unlike other better characterized channels, 
such as voltage-gated potassium channels' 321 , currently lack a 
signature sequence motif that might help to restrict the search 
space of possible membrane proteins involved in chloride con- 
ductance' 31 '. Thus, an unbiased genomic screen using a very 
specific phenotypic outcome was used to perform the bulk of 
candidate selection, with bioinformatics refining the hit list 
rather than defining the initial experimental scope. 

These two studies used functional genomics to identify can- 
didate genes whose loss is causally linked to the phenotype 
of interest. A related approach is to find genes that are cor- 
related, through expression level, with this phenotype. This 
method was used in one of three studies reporting discovery 
of the calcium-sensitive chloride channel transmembrane pro- 
tein 16 (TMEM16a) [11 ' 16 ' 19 '. Here, the authors used microarray 
analysis of bronchial epithelial cells, which display increased 
calcium-activated chloride current following interleukin 4 
(IL-4) treatment' 16 '. After identifying genes differentially 
expressed following IL-4 treatment, topological predic- 
tions to filter the hit list guided subsequent identification of 
TMEM16a' 16 '. A similar strategy of identifying differentially 
expressed genes correlated with a phenotype of interest was 
used to identify channels involved with mechanosensation. 
Unlike other tissues examined by the authors, mouse neuro 
2a (N2A) neural crest cells displayed a mechanosensitive cur- 
rent, leading to the hypothesis that pressure sensitive chan- 
nels would be represented among transcripts enriched in this 
cellular population' 33 '. Experimental studies of the resulting 
candidates identified Peizol and Peizo2 as mechanosensitive 
channels' 33 '. As with functional genomics approaches, the 
success of these studies appears to require very specific phe- 
notypic queries that may be compared to large genomic space 
using profiling methodologies such as microarrays. 

An example of integrated genomic analysis is the discovery 
of the mitochondrial calcium uniporter component MCU' 17 '. 
Here, the authors leveraged previous mass spectrometric 
profiling of the mitochondrial proteome' 34 ', phylogenetic 
conservation of genes along an evolutionary tree, and tissue 
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coexpression 1 1 to identify genes with similar profiles across 
these three parameters compared to the uniporter regulator 
mitochondrial calcium uptake 1 (MICU1) [171 . This analysis 
identified MCU as a top candidate across all three parameters, 
a prediction verified by subsequent functional experiments. 
Unlike the previously described studies, bioinformatics played 
a key role in forming the initial 'hit list' for experimental vali- 
dation, rather than refining a list that was primarily generated 
through unbiased screening with reference to a phenotype 
of interest. Another striking example of this purely bioinfor- 
matic discovery is the identification of the Ciona intestinalis 
voltage-sensitive phosphatase (Ci-VSP) [181 . In a 'perfect storm' 
of sequence homology, this gene was found to contain both 
a well-defined voltage sensor similar to ion channels and a 
phosphatase region. Thus, even though such a combination of 
modular units might not have been anticipated based on exist- 
ing knowledge of these two protein families, unbiased compu- 
tational screening allowed discovery of this novel transmem- 
brane protein. 

In most of the examples described, bioinformatic techniques 
have been utilized after unbiased, genome-wide analyses to 
filter candidate lists of potential membrane proteins underly- 
ing a phenomenon of interest, rather than identifying an ini- 
tial, limited set for experimental evaluation. Also noteworthy 
is the fact that many of these studies utilize differences in tis- 
sue phenotypes such as ionic currents sensitive to particular 
stimuli, to identify candidate genes, rather than computational 
motifs. As noted above in analysis of chloride channels, the 
lack of well-defined functional motifs that might be used as 



an in silico filter necessitates this sort of approach. However, 
in the absence of a well-defined phenotype, how might novel 
membrane proteins be prioritized for characterization? How 
might the natural ligands, substrates and protein interaction 
partners of otherwise well-characterized orphan proteins 
be elucidated? We examine this question by first describing 
topological prediction algorithms, then methods for functional 
inference. 

Prediction of membrane proteins and topology 

The first level of discrimination in the bioinformatics analysis 
depicted in Figure 1 is to separate putative membrane proteins 
from soluble proteins. A number of algorithms have been 
reported for this task, as illustrated in Figure 2 and summa- 
rized in Table 2. 

Some of the early studies in this field identified simple and 
effective heuristics for topology prediction. This is demon- 
strated by 'rules-based' methods such as Topology Predic- 
tion (TOPRED), which score each amino acid using the mean 
hydrophobicity of its surrounding residues, and calculate 
putative transmembrane regions and topology using the 'posi- 
tive inside' rule' 361 in which positively charged residues have a 
bias to face the cytoplasm' 371 . Thus, topological predictions are 
generated in a manner analogous to a Doolittle Plot' 381 , by find- 
ing a threshold for hydrophobicity that will divide a protein's 
hydrophobicity profile into transmembrane and cytosolic ele- 
ments. Similarly, alignment methods extend this idea by seek- 
ing supporting information across multiple proteins, such as 
dense alignment surface (DAS) and transmembrane multiple 
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Figure 2. Algorithms for topological and functional prediction. Primary amino acid sequence (top left) is employed to predict secondary structure 
topology motifs (transmembrane helices, cytosolic loops, signal peptides) (top right), while secondary descriptors describing composition or substitution 
patterns of amino acids (bottom) are used for functional prediction for membrane proteins. 
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Table 2. Algorithms for predicting membrane proteins. 



Algorithm class 


Examples 


References 


Hidden markov models 


TMHMM, phobius, SCAMPI, 


[43, 46-48, 




SPOCTOPUS, THUMBUP, 


85-88] 




HMMT0P1-2, TMpred 




Neural network 


MEMSAT3, SPOCTOPUS, 


[47, 49, 89,90] 




PSIPRED, PHD 




Support vector machine 


MEMSAT-SVM 


[50] 


Consensus 


BROMPT, ConPredll 


[51, 52] 


Rules based 


T0PRED1-2, SOSUI, KKD 


[37, 91-93] 


Dynamic programming 


MEMSAT 


[44] 


Alignment based 


DAS, TMAP 


[39, 40] 



alignment prediction (TMAP), which generate consensus dot 
plots comparing the hydropathy profile of the protein of inter- 
est to a collection of background reference sequences or to 
multiple sequence alignments with homologs' 39,40 '. 

Later developments have further explored the use of pat- 
terns present across databases of known proteins to identify 
useful statistical patterns for topological analysis. As with 
homology searches in genomic databases, hidden Markov 
models (HMMs) are a popular method to model the statistical 
properties of biological sequences. HMMs were developed 
for automated speech processing' 41, 42 ', in which an observed 
audiogram is produced by a set of unknown words correlated 
with certain tonal patterns. The algorithm then statistically 
reconstructs the most likely word producing a given pattern 
of sounds over each time interval given these input prop- 
erties. Similarly, an observed distribution of amino acids 
may be considered as an observed 'signal', with the hidden 
states being topological descriptions (such as transmembrane 
helix or cytoplasmic loop), which produce different distribu- 
tions of observed amino acids' 431 . This process, which cycles 
through each amino acid to find the optimal series of 'states' 
that explain the observed pattern, resembles earlier dynamic 
programming approaches which sought to find an optimal 
topological prediction by iteratively building up predic- 
tions from sub-sequences' 441 . Additional complexity arises 
from the fact that type I membrane proteins possess a signal 
peptide directing them to the secretory pathway' 43 ', a motif 
that resembles a transmembrane helix and thus may be mis- 
identified by the algorithm. Thus, HMMs may be improved 
by incorporating 'signal peptide' as one of their hidden states, 
as implemented in the Phobius and signal peptide obtainer 
of correct topologies for uncharacterized sequences (SPOC- 
TOPUS) programs' 46 ' 47 '. Other variations of this approach are 
possible, such as the scale-based method for prediction of inte- 
gral membrane proteins (SCAMPI) program, which uses the 
predicted free energy of amino acids as the 'observed state' 
instead of the amino acids themselves' 48 '. Taken together, 
these prediction methods have demonstrated remarkable 
accuracies of 80% -97% in discriminating soluble from mem- 
brane proteins and predicting transmembrane helices in retro- 



spective analyses' 4 ' 4 '. Additionally, as demonstrated by pre- 
viously described studies, these methods' practical utility has 
been proven in successful filtration of hits lists from unbiased 
screens. 

These successes are particularly notable given the inherently 
challenging nature of the problem they tackle. Indeed, one of 
the complexities of predicting protein topology from biologi- 
cal sequence is the inherent dependency between position 
and structure. Neural networks (NNs) are another approach 
that seeks to represent these nonlinearities by mapping a set 
of inputs [such as position-specific scoring matrices (PSSMs) 
representing the likelihood of residues in particular positions 
over a sliding window of a protein structure] to topological 
states such as transmembrane helices' 49 '. This mapping is per- 
formed by connecting the input data to the output through a 
series of 'neurons,' a set of logistic functions whose sigmoi- 
dal behavior in response to their inputs resembles activation 
thresholds in the mammalian nervous system. The observed 
toplogical states of a protein are thus modeled as a weighted 
combination of nonlinear activation functions, and the weights 
connecting the units are optimized to best reconstruct the 
desired output. This approach may be used independently, 
or combined with other algorithms. For example, the SPOC- 
TOPUS program combines a NN and HMM, using the output 
from NN as an input to HMM' 47 ', thus improving inference of 
the 'hidden states' in the HMM. 

In addition to NNs, Support Vector Machines (SVMs) have 
also been utilized to perform a nonlinear mapping from 
input sequences to topological states. The 'support vector' 
in the name is derived from the fact that only a small subset 
of the data used to develop the model are used to generate 
parameters for future prediction. These 'support vectors' lie 
at the boundary between the classes of data, such as trans- 
membrane helices and cytosolic loop sequences, which the 
algorithm seeks to classify. A strength of this approach is that 
the SVM may use a similarity function, such as the Gaussian 
distribution or a polynomial, to find a boundary separating 
these classes which may be intermingled in their original vec- 
tor space. Like NNs, SVMs applied to topology prediction 
may also utilize as input a Position Specific Scoring Matrices 
(PSSM) for a sliding window over the protein sequence' 50 '. 
Generating this prediction over the whole length of the protein 
thus yields a predicted topology. 

Given that each algorithm discussed above may have 
scenarios in which it performs better or worse, it seems rea- 
sonable to infer that combining some of these methods may 
overcome some of these individual shortcomings. This sort of 
combination has the benefit of offsetting weakness in a single 
method, and for potentially pooling weak evidence from mul- 
tiple predictions to yield stronger collective evidence. For 
example the consensus prediction (ConPred) algorithm uses a 
heuristic rules system to average inputs from multiple topol- 
ogy prediction methods to derive a consensus' 31 '. Similarly, 
Bayesian prediction of membrane protein topology (BROMPT) 
uses a Bayesian belief network to combine evidence from five 
methods into a consensus' 52 ', modeling this consensus as a 
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'child' node that receives weighted inputs from the five 'parent' 
methods. 

The previously described algorithms, whether they employ 
amino acid frequencies, hydropathy, or folding free energy, 
primarily use information derived from the linear, primary 
structure of amino acid sequences. The resulting topology 
gives a 'flat' inference for tertiary or quaternary structure, but 
little guidance as to how the resulting helices are organized in 
a three-dimensional space. Such challenges have prompted 
the development of algorithms building on two dimensional 
topological predictions to infer three dimensional coordinates 
based on linear amino acid sequences, utilizing the popula- 
tion of previously solved x-ray crystal structures of membrane 
proteins to generate homology-based predictions I 33-351 . In the 
absence of gold-standard structural data for most membrane 
proteins and channels, such techniques may represent the 
next-best option for tasks such as virtual small-molecule dock- 
ing that require three-dimensional coordinates. 

Functional sub-classification of transmembrane protein 
classes 

After membrane proteins are identified and separated from 
soluble proteins using the topology prediction programs 
outlined above, the second level of classification in Figure 1 
involves grouping the population of membrane proteins into 
individual functional classes, and to prospectively identify the 
function of characterized genes. Several methods have been 
reported to accomplish this task, which are summarized in 
Table 3 and visually diagrammed in Figure 2. 



Table 3. Algorithms for predicting functional class of membrane proteins. 



Features 


Algorithms 


References 


Amino acid composition 


DISC-FUNCTION, VGIChan, 


[56-58] 




VKCPred 




Dipeptide composition 


Transporter-RBF, VGIChan, 


[56, 57, 




ionchanPred, VKCPred 


60, 61] 


Psi-blast PSSM 


Transport targets, transporter-RBF 


[61, 62] 


Amino acid descriptors 


Transport targets, transporter-RBF 


[61, 62] 


Pfam domains, gene 


TransportTP 


[68] 


ontology annotation 







As with many topology prediction algorithms, these meth- 
ods often require the amino acid sequence to be summarized 
in a quantitative fashion to compare two proteins. One such 
descriptor that has been successfully utilized is the fraction of 
a protein's sequence comprised of each of the twenty naturally 
occurring amino acids, a vector of length twenty that sums 
to one and is termed the 'amino acid composition'' 56,371 . The 
intuition behind this descriptor is that distinct classes of mem- 
brane proteins have a bias to include particular amino acids at 
greater frequency due to the structural requirements or con- 
straints for their function. Refinements of the amino acid com- 
position descriptor have also been proposed, such as using the 



un-normalized count of the twenty amino acids in a protein 
sequence, a method reported to be more effective as it also 
captures differences in the characteristic length of a protein 
family' 581 . Similarly, expanding the normalized amino acid 
composition to a vector length sixty - twenty for composition 
of the whole protein, and twenty elements each for the amino 
acid composition of transmembrane and non-transmembrane 
segments - has also allowed better discrimination 15 ' 1 . Like 
amino acid composition, dipeptide frequencies have also been 
successfully utilized as descriptors to discriminate membrane 
proteins of different classes 156,57 ' 601 . The previously mentioned 
PSSM derived from Position-Specific Iterative BLAST (PSI 
BLAST), which measure the likelihood of a substitution from 
the observed to an alternate amino acid at a particular posi- 
tion based on substitution patterns between a protein and its 
homologous neighbors, have also been found to have high 
sensitivity as a descriptor 1611 . More abstractly, numerical 
descriptors of folding energetics have also been employed in 
predictive models 1611 . 

Just as the input descriptors to these algorithms are varied, 
so are the kinds of functional predictions produced in these 
studies. Several methods have been used to predict a query 
gene's family membership, such as classifying channels, trans- 
porters, and carriers from one another 1581 . In greater detail, 
these methods have also been used to predict a protein's sub- 
strate, such as different metal ions for channels or protein/ 
nucleic acids for transporters 1621 . Predictions have also been 
targeted for functional parameters specific to particular classes 
of membrane proteins. For example, amino acid sequence 
has been used to predict the half-maximal activation potential 
of voltage gated channels 1631 , discriminate between channels 
based on their electrophysiological parameters 1641 , or identify 
channels that may serve as promising therapeutic targets 1651 . 

These previously described methods, in essence, rely on the 
proximity of a query protein to a neighborhood of known pro- 
teins in the space of the descriptor used. Further refinements 
have been proposed, where this proximity measurement may 
be combined with other features such as Gene Ontology terms 
describing the biological processes, molecular functions, sub- 
cellular localization of a protein 1661 , presence of class-associated 
protein families (Pfam) domains 1671 , or the number of predicted 
transmembrane domains' 681 . The resulting combination of 
annotated and raw sequence information may then be used 
in a prediction algorithm such as the previously discussed 
SVM' 68 '. Indeed, the ability of amino acid profiles to serve as 
relevant features for identifying functionally related proteins 
may suggest that families share specific motifs, and specific 
structural fragments and motifs have also been identified in 
related studies' 69,70 '. 

Expanding these predictions based on two-dimensional 
structure correlated with classifications or functional param- 
eters, methods have also been developed to directly infer func- 
tion based on a three dimensional conformation. For example, 
the SLITHER program uses molecular modeling simulations 
to predict whether a putative substrate molecule may perme- 
ate the cavities or channels in a protein structure' 71 '. In cases 
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where the existence of a channel in a protein is unverified, the 
MolAxis program can be used to predict whether they exist 
using computational geometry' 721 . Obviously, both of these 
methodologies require three-dimensional protein coordinates 
which are experimentally unavailable for most channels or 
other membrane proteins, but might be combined with homol- 
ogy-based three dimensional structure predictions described 
in the previous section to generate functional predictions for 
inferred three dimensional structures. 

A related functional prediction is to identify the natural 
ligand, ion substrate or protein interaction partner of the novel 
proteins. Indeed, examples that highlight the challenge of 
deorphanizing a large number of seven transmembrane pro- 
tein receptors, where the natural binding partner(s) of some 
otherwise well-characterized transmembrane receptors such 
as BRS-3 remains unknown' 731 . Though not specifically devel- 
oped to identify peptide - receptor interactions in silico, large- 
scale predictions of protein-protein interactions have been 
described using two and three-dimensional information 174 " 761 . 
Conceivably, such algorithms might be employed to identify 
novel interactions between peptide ligands and the subset of 
peptide-binding receptors. Direct bioinformatics identification 
of ligands such as neuropeptide precursors have also benefited 
from the increased availability of genome-wide proteomic 
and nucleotide data, as demonstrated by the computational 
prediction of more than 200 novel neuropeptides in the hon- 
eybee Apis mellifera, of which 100 were validated using pep- 
tidomics 1771 . Related studies of the red flour beetle Tribolium 
castaneum have employed homology analysis to validate 30/41 
predicted neuropeptide genes using mass spectrometry data, 
encoding 71 peptides 1781 . Given the accuracy of the predictions 
in these studies using large genomics datasets, we speculate 
that such methods and information provide a promising pool 
of potential novel ligands that might be screened in functional 
assays against putative peptide-binding receptors. 

A reference map of uncharacterized membrane proteins 

In the previous sections we have provided an overview of 
experimental and computational methodologies used to de- 
orphanize uncharacterized membrane proteins. Here we 
quantify how much of the transmembrane proteome has been 
characterized, and whether the coverage of the characterized 
regions is biased toward proteins with a particular topology 
by generating a reference map of the human transmembrane 
proteome. 

This analysis is based on 35 879 unique human RefSeq pro- 
tein sequences downloaded from NCBI as GenBank records. 
To reduce bias in our analysis resulting from proteins with 
multiple isoforms, we collapsed this collection into unique 
gene symbols by retaining only the entry (for a given gene 
symbol) with the greatest number of annotated transmem- 
brane segments among annotated sites in their GenBank 
fields, under the hypothesis that the sequence with the most 
annotated segments represents the most studied and highest- 
quality record for a particular gene. In cases where the gene 
has no transmembrane helices we simply kept the first occur- 



ring entry. Applying this filter left 19977 sequences. Because 
uncharacterized membrane proteins may lack annotated 
transmembrane segments, we utilized several of the previ- 
ously described topology prediction programs to generate an 
estimated transmembrane segment count for these orphan 
proteins. The three programs used were TMHMM2.0 1431 , 
SCAMPI_multi 1791 , and PHOBIUS1.0.1 1461 , and the weights 
used to average the predictions were estimated using a linear 
regression against a count of known transmembrane seg- 
ments. 

We estimated the number of membrane proteins using a cri- 
terion of one or more predicted or annotated transmembrane 
segments. This analysis yielded 4991 of the 19977 sequences 
for unique genes passing this filter, corresponding to ~25% of 
the genome, a value in reasonable alignment with previous 
estimates 131 . To determine which of these 4991 membrane pro- 
teins were previously unannotated, we used two approaches. 
First, we selected a list of all RefSeq sequences lacking a Gene 
Reference into Function (GeneRIF) annotation, giving a set of 
5723 unique proteins. While this filter can identify sequences 
that have previously been annotated for function, the lack of a 
hierarchy or sub-classification of these annotations by strength 
of evidence means that some of these sequences may actually 
by effectively uncharacterized. By manually examining many 
entries, we have indeed found that some GeneRIF entries 
describe presumed or inferred function without experimental 
support. While these may be useful for generating hypoth- 
eses, this ambiguity complicates our estimate of the number 
of uncharacterized membrane proteins. Thus, we also uti- 
lized the independent annotation in the Gene Ontology (GO) 
database. Following a similar methodology used to identify 
uncharacterized proteins in Arabidopsis thaliana 1 , we identi- 
fied all proteins either lacking GO annotation (2983 proteins) 
or having no data (ND) evidence code for Molecular Function 
(MF) annotation at the root node (the default assignment in the 
GO for uncharacterized proteins) (597 proteins), giving a total 
of 3580. These intersect with the GeneRIF-based set by 2431. 
The union of the uncharacterized sets gives 6872 proteins, of 
which ~25% (1533) are transmembrane. In contrast, only 216 
of the intersecting set of 2431 are in our estimated transmem- 
brane set, so we used the union of the estimated uncharacter- 
ized sets as a less conservative approach. A summary of all 
filters applied is given in Figure 3 A. Many of the 4991 esti- 
mated membrane proteins in this analysis (3791, ~76%) have 
GO annotations for MF, including 1479 unique terms (as a 
single protein may have more than one MF annotation). The 
distribution of all MF terms assigned to more than ten proteins 
(167 terms) is shown in Figure 3B, indicating that G-protein 
coupled receptors, olfactory receptors, nucleotide binding 
receptors, and calcium interacting proteins dominate this list. 
To independently evaluate the quality of our inference, we 
used the same approach to predict the number of transmem- 
brane proteins in Saccharomyces cerevisiae. The localization of 
approximately 75% of the yeast genome has been experimen- 
tally assessed using Green Fluorescent Protein (GFP)-tagged 
fusion proteins to determine presence/ absence at twenty- two 
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G-protein coupled receptor activity 

Figure 3. Estimating the number of uncharacterized human membrane 
proteins. (A) Human RefSeq protein sequences are collapsed to unique 
genes. Three topology prediction algorithms are averaged to generate 
a list of predicted membrane proteins, and merged with membrane 
proteins derived from GenBank transmembrane helix annotations to 
yield a combined population of estimated membrane proteins. Previous 
functional annotations are evaluated using GeneRIF fields and Gene 
Ontology (GO) records, which are merged to yield a combined population 
of estimated uncharacterized proteins. The intersection of the membrane 
and uncharacterized populations represent uncharacterized membrane 
proteins. (B) Distribution of top GO molecular function (MF) categories for 
all membrane proteins. 

organelle sites' 811 , and we used this information to assess the 
accuracy of TM protein predictions. These analyses, shown in 
Figure 4, demonstrate that the predicted transmembrane pro- 
teins, which constitute ~20% of the yeast genome, are experi- 
mentally localized in the Endoplasmic Reticulum (ER), secre- 
tion pathway (vacuole) and cell periphery at higher rates (18, 
13, and 7-fold respectively) than predicted soluble proteins, 
whose localization records are biased for the cytoplasm and 
nucleus (5 fold and 7.5-fold enrichment, respectively). While 
the discrimination is not perfect, the population of predicted 
TM proteins in yeast obtained using the predictive methodol- 
ogy from the human analysis is enriched for experimentally 
annotated localization at membrane sites, supporting the use 
of these topological predictions as a proxy for TM localization. 

To gain a global overview of the distribution of the anno- 
tated and unannotated membrane proteins identified in our 
analysis, we generated a vector description of each sequence to 



allow systematic comparison. The first twenty elements of this 
vector contain the count of each of the twenty naturally occur- 
ring amino acids in the proteins sequence. The next twenty 
contain these counts restricted to the transmembrane regions, 
while the last twenty contain the counts for the cytosolic loops, 
giving a total length of sixty. All counts were calculated using 
the topological prediction output of TMHMM2.0 for trans- 
membrane segments for consistency. The resulting vectors of 
length sixty were then embedded in a low-dimensional map 
using t-Stochastic Neighbor Embedding (t-SNE), an algorithm 
that produces coordinate maps of high-dimensional data 
which represent the pairwise similarity between objects' 821 . 
This algorithm, compared to other nonlinear methods, has 
been shown to better separate images of handwritten digits 
and facial photographs into distinct clusters in two dimen- 
sional space 1821 . To separate the resulting map into regions, 
we clustered the resulting coordinates from t-SNE using 
affinity propagation 1831 using the squared Euclidean distance 
between the t-SNE coordinates and the maximum pairwise 
distance as the input preference for each datapoint to be a 
cluster center. The resulting map for membrane proteins with 
previously annotated function is displayed in Figure 5A, with 
colors representing the clusters defined by affinity propaga- 
tion. The distribution of the estimated set of uncharacterized 
membrane proteins is shown in Figure 5B. While the range of 
space covered by the characterized and uncharacterized sets 
is comparable, the density of the uncharacterized membrane 
proteins is concentrated in a region of space occupied by seven 
transmembrane segment receptors (Figure 6, 5D) and reflect- 
ing orphan olfactory receptors. This is reflected quantitatively 
by the modest correlation coefficient of 0.20 between the grid- 
cell counts of the uncharacterized and characterized sequences 
in Figure 6. 

While this analysis can distinguish broad structural classes 
of membrane proteins, as shown by the spatial localization of 
seven membrane proteins (Figure 5D), voltage-gated sodium 
and calcium channels are also intermingled with transport- 
ers in the lower left quadrant. While these proteins might be 
topologically similar, they are clearly functionally distinct. It 
thus remains unclear which class of sequence descriptors, if 
any, can best capture functional differences in this kind of 
analysis, and how to evaluate the accuracy of such features. 
From the perspective of future de-orphanization, it appears 
encouraging that the TMEM class of proteins is broadly dis- 
tributed across the sequence space, suggesting that membrane 
proteins of many functional or topological classes may yet be 
elucidated. 

Perspective 

Review of the literature suggests a 'gap' between experimental 
and computational methods. While in silico functional predic- 
tions are primarily verified through retrospective accuracy, 
experimental studies with unbiased genomics approaches 
use bioinformatics as a way to pare down a candidate list, 
rather than restrict and guide the initial search space. Thus, 
we anticipate there are unrealized opportunities for predictive 
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Figure 4. Subcellular localization of predicted membrane and soluble proteins in S Cerevisiase. GFP fusions of individual yeast proteins have been 
expressed, localized and annotated 1811 . (A) Analytical pipeline for prediction of yeast membrane proteins beginning with 5909 RefSeq entries that are 
filtered and resulted in 4973 unique gene names. Topology algorithms for unique genes yield 920 putative membrane proteins and 4053 putative 
soluble proteins. Fractions of both groups possess experimentally determined subcellular locations. (B) The distribution of experimentally determined 
localization(s) for predicted membrane and soluble proteins in (A) among 22 cellular sites. Bar lengths are normalized to the total number of subcellular 
location sites available for predicted membrane and soluble proteins from (A). 



algorithms to be used to identify novel membrane proteins 
and suggest possible phenotypes for functional validation. 
Furthermore, such studies will help computational researchers 
to better understand which models and descriptions of protein 
structure are most successful in predicting the results of these 
experimental validations, and thus iteratively improve the 
underlying bioinformatics algorithms. 

We also speculate that predictions of three-dimensional 
structure have not been fully exploited for these kinds of stud- 
ies. Indeed, the small fraction of transmembrane drug targets 
with crystal structures derived from DrugBank' 841 indicated 
in Table 4 suggests that this is a role in which bioinformatics 
may fill a large existing knowledge gap. As more membrane 



Table 4. Structural characterization of human drug targets. 





Total 




Trans- 


Description 


number 


Soluble 


membrane 




Unique genes 


19 977 


14 986 (75%) 


4 991 (25%) 


Unique drug targets 


2 048 


1 357 (68%) 


636 (32%) 


Unique drug targets with structure 


791 


555 (72%) 


216 (28%) 



Note: The 'total number' column doesn't add up to the soluble and 
transmembrane counts for the last two rows because the gene symbols in 
DrugBank don't map to 100% of the RefSeq entries. 



proteins are crystallized and homology-based three dimen- 
sional coordinate prediction methods become more mature, 
it is intriguing to speculate that tertiary structure predictions 
might generate functional predictions using substrate dock- 
ing, in a manner similar to virtual screening of small molecule 
ligands. Such approaches might complement existing predic- 
tors based on amino acid sequence alone. 

An additional challenge comes from the fact that deorpha- 
nization often involves identification of unknown functions. 
Indeed, while many of the experimental studies discussed 
here have sought to generate candidate lists based on a spe- 
cific phenotype, the challenge may often lie in assessing a 
completely unknown function. Indeed, even in cases where 
bioinformatics has perfectly identified a novel protein, such 
as Ci-VSP, which contains both a voltage-sensing domain and 
phosphatase catalytic domain' 181 , the substrate of this enzyme, 
and thus its biological role, was not immediately apparent 
from the initial characterization. Therefore, in the absence of 
functional knowledge - for example, lack of knowledge of a 
channel's presumed triggering stimulus that generates current 
- modified screening approaches will be needed to probe the 
function of unannotated membrane proteins. Ion channels as 
a class share the properties of conducting ionic currents, a gen- 
eral feature that may be exploited. Recent innovations in high- 
throughput patch clamping may allow a matrix of different 
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Figure 5. Landscape of human membrane protein diversity. Two dimensional embedded coordinates are generated from vectors counting the number 
of each of the twenty amino acids in a whole protein sequence, transmembrane segments, and cytosolic segments for 4991 estimated human 
membrane proteins, using the t-stochastic neighbor embedding (t-SNE) algorithm. Colors represent groups identified by applying affinity propagation 
clustering to the embedded coordinates. (A) Embedded coordinates and cluster identity of subset of human membrane proteins with previous 
functional annotation. (B) As in (A), for uncharacterized membrane proteins. (C) As in (A), for TMEM proteins. (D) As in (A), for sequences with seven 
transmembrane segments as denoted by RefSeq annotations or averaged predictions of three topology algorithms. 
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Figure 6. Density profile of landscape of human membrane proteins. Embedded two-dimensional coordinates generated from vectors containing 
counts of each of the twenty amino acids in a whole protein sequence, membrane segments, and cytosolic segments, using the t-stochastic neighbor 
embedding (t-SNE) algorithm for 4991 estimated human membrane proteins. (A) Count per coordinate grid representing the number of sequences 
(colorbar) for the subset of human membrane proteins with previous annotation. (B) As in (A), for uncharacterized membrane proteins. 



potential stimuli and buffer conditions to be tested, allowing high-throughput imaging to determine localization, have the 
rapid functional profiling. Such approaches, combined with potential to systematize the characterization of novel mem- 
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brane proteins. 

In summary, while the characterization of the transmem- 
brane genome has witnessed many informatics and experi- 
mental successes, our analysis shows that almost one-third 
of the membrane proteins still lack functional annotation. 
Given the current seeming lack of overlap between bioin- 
formatics and unbiased screening approaches, we speculate 
there are opportunities for predictive algorithms to further 
refine screening studies, and for new profiling technologies 
to validate these predictive algorithms. This combination of 
smarter analytics and broader experimental methodology may 
thus help deorphanize the remaining membrane proteins in 
the genome, offering potential drug targets as well as greater 
understanding of these genes' biological roles. 
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